Good, bad, ugly feature extraction
So far, we have created a function that extract “our” features on a single text. Now, it is time to go over each row in dataframe and apply the function on plot summaries (the classic application of loops). So, the first option comes to the mind is using for loop. Caution!, this could get really ugly. TL:DR: TRY TO AVOID “FOR LOOPS” in R unless you are familiar with primitive functions or dataset is small enough. I will briefly explain it in the next paragraph, but there is no heartbreak in skipping it.
optional According to Hadley Wickham’s Advanced R “For loops in R have a reputation for being slow. Often that slowness is because you’re modifying a copy instead of modifying in place”. As an example, the function below is applying for loops on a dataframe. Since dataframe is quasi-built-in type of data in R and not a primitive function developed in C, each time for loop is being compiled, a new COPY of sample is created and keyword and word_counts columns are being updated (Thanks to updates on R 3.1.0. the copy is not deep anymore). Imaging you are building a wall from scratch in a new place anytime you would like to add or change a brick on it. That makes the function really slow and ugly.
for_on_dataframe <- function(sample) {
for (i in 1:nrow(sample))
sample[i, c('keyword', 'word_counts')] <- extract_text_features(sample[i, 2])
return(sample)
}
As discussed above, using for loops on “dataframes” could get really ugly as your data size increases. However, we might be able to improve the performance through changing data.frame to list objects or using apply family functions. The apply functions in R generally don’t provide improved performance over other loop functions although it is stated in some resources that lapply can be a little faster because it does more work in C code than in R. The chunk below includes 2 functions from apply family: lapply and mapply (a multivariate form of sapply) and a for loop on list item.
lapply_script <- function(sample) {
sample[, c('keyword', 'word_count')] <- do.call(rbind, lapply(sample[[2]], extract_text_features))
return(sample)
}
mapply_script <- function(sample) {
sample[, c('keyword', 'word_count')] <- mapply(extract_text_features, sample[, 2])
return(sample)
}
for_on_list <- function(sample) {
sample <- as.list(sample)
for (i in 1:length(sample[[1]])) {
temp <- extract_text_features(sample[[2]][i])
sample[['keyword']][i] <- temp[1]
sample[['word_count']][i] <- temp[2]
}
}
Obviously, extract_text_feature tasks can be done independantly on each plot summary. We might be able to speed up the process through parallelizing the tasks on a machine with multiple cores. If you would like to dig deeper in the parallilzation, please check out this awesome page. The code chunk below is using doParallel library to implement parallel code and provide a parallel backend for the foreach package. In spite of it’s great advantages, parallelizing R processes have some challanges that we need to keep in mind. Again, feel free to skip the next paragraph if you are in a rush.
optional I have personally dealt with two main caveats in using parallel processes: a) mapping the task among multiple cores and reducing the results back usually take time and work. We would only get a substantial speed up if the task (extract_text_features function in this example) takes enough time for the overhead to be worth it. If you are interested in estimating the overhead time, you should look into the difference between elapsed - user - system of the Current session in 5 element proc.time function (at least in Unix OS). The print format of system.time only includes total execution time of all sessions in different cores and might not tell the detailed story. b) Accessing global variables and dealing with global state will be different than single thread execution. As each processor will get a copy of the vector that we’ve passed in, if we modify the data inplace, it will simply be modifying the copy instead of the original vector. To overcome data handling issues within cores, we should set the foreach in a way that it returns the information from the parallelized task:
extracts <- foreach::foreach (i = 1:nrow(sample), .combine = rbind) %dopar% extract_text_features(sample[i, 2])
parallel_loop_script <- function(sample) {
doParallel::registerDoParallel(num_cores)
extracts <- foreach::foreach (i = 1:nrow(sample), .combine = rbind) %dopar%
extract_text_features(sample[i, 2])
return(as.data.frame(extracts))
}
Special (or Parallel) Effects
Now, Let’s go ahead and assess the performance of each loop function. We categorized the loops into The Good, The Bad, The Ugly. Although the logic behind the good, bad, and ugly is independant, the functions could be ugly in a dataset and “good enough” in another one.
plot_summary <- data.table::fread('../data/plot_summaries.txt', header = FALSE, quote = '')
plot_summary <- as.data.frame(plot_summary)
batch_size <- 4000
row_number <- seq(batch_size, nrow(plot_summary), by= batch_size)
# If foreach is better let's change the behavior and use it while we can
runtime_list <- foreach (i = 1:length(row_number)) %dopar%
{
for_dataframe <- system.time(for_on_dataframe(plot_summary[1:row_number[i],]))
lapply_method <- system.time(lapply_script(plot_summary[1:row_number[i],]))
mapply_method <- system.time(mapply_script(plot_summary[1:row_number[i],]))
for_list <- system.time(for_on_list(plot_summary[1:row_number[i],]))
for_parrallel <- system.time(parallel_loop_script(plot_summary[1:row_number[i],]))
return(list(i, for_dataframe[[3]], lapply_method[[3]], mapply_method[[3]], for_list[[3]], for_parrallel[[3]]))
}
# change the list of lists to dataframe
runtime_df <- do.call(rbind.data.frame, runtime_list)
names(runtime_df) <- c('index', 'for_dataframe', 'lapply+do.call', 'mapply', 'for_list', 'parallel_loop')
# need to melt the dataframe to be able to line plot them
melted <- reshape2::melt(runtime_df, measure.vars = c('for_dataframe', 'lapply+do.call', 'mapply', 'for_list', 'parallel_loop'))
ggplot(melted, aes(batch_size*index, value, color = variable)) +
geom_line(aes(group = paste0(variable))) +
xlab('Number of summary plots') + ylab('execution time (seconds)') +
ggtitle('Execution time of loop functions as text dataset gets large')
# x <- 1:length(run_time1)
# plot(x, run_time1, type = 'l', col = 'blue')
# lines(x, run_time2, type = 'l', col = 'red')
# lines(x, run_time3, type = 'l', col = 'green')
# legend(x=1.5, y=1.5, legend=c("for", "mapply", 'dopar'), col = c('blue', 'red', 'green'), lty=1)
As figure above shows, “for loop on dataframes” has the ugliest performance at any size of the dataset. “for loop on list” and apply functions are performing better although there is not a big difference between them. The worse performance of lapply is because the do.call(rbind) function is also used to bind the results of the function. So, it does not deny it’s possible & slight advantage to the other apply family functions. The parallel implementation is providing the best performance as we are using all cores of the machine (in this example, I am running the scripts on a 8-core MacBook Pro with 2.8 GHz intel Core i7 and 16 GB 2133MHz DDR3 RAM). We have, thoroughly, discussed the reasons behind the performance of each set of implementations in Good, bad, ugly feature extraction.
Ending
We have discussed about some techniques to change our R programming habbits for better performances. Those methods can also be helpful if you are dealing with memory issues on your machine, specifically by reducing the number of unnecessary duplicated objects from memory. There are other techniques available to speed up Rscript execution such as vectorization, writing C++ function using RCPP or using parallel format of apply family (like mcapply). By applying parallel computin on text dataset, we could improve the performance of extracting text feature by 6 times. That would feel awesome if we are going to use that function in a shiny app or building an APIs.
Parallelization would not be an optimal solution for sequential processes and the jobs that already runs really fast. If we’re not sure if the parallelization of specific piece of code improves its speed, we can always check the execution time using system.time() or profilers. Taking advantage of primitive functions, vectorization and C-based packages, however, will most prbably speed up execution process in R.
I hope this article could shed another light on possible pitfalls with using some functions/habbits in R (especially when the data gets larger and dirtier). I will share more as I find out my ugly habbits in R and Python (I am sure I have plenty of them).